2. Probability Theory for Thermodynamics#
Relevant readings and preparation
Concepts in Thermal Physics: Chapter 3: 3.1-3.8: pg. 20-28
Learning outcomes:
Distinguish between discrete and continuous probability distributions, and describe examples of each in physical contexts.
Define and calculate the variance and standard deviation of a distribution.
Understand how linear transformations affect the mean and variance of a random variable.
Describe the properties and applications of the binomial distribution in modelling discrete physical events.
Recognise the difference between independent and dependent probabilities, and compute joint or conditional probabilities accordingly.
Apply Bayes’ theorem to update probabilities in light of new evidence.
Identify and use the normal (Gaussian) distribution as an approximation to many natural phenomena.
Describe the Poisson distribution and its relevance to counting statistics and random thermal events.
Reality is filled with uncertainty. Every action or decision we take must be made with incomplete information, since the chain of events leading to an outcome is often so complex that the exact result is unpredictable. Nevertheless, we can still act with quantifiable confidence in an uncertain world. Incomplete information is better than none at all, for example - it is more useful to know that there is a 20% chance of rain tomorrow than to have no forecast whatsoever. Probability is the mathematical framework that allows us to quantify uncertainty, and it is ubiquitous across all fields of scientific study, alongside finance sectors, software development, politics and so forth.
Probability theory has had an undeniably strong impact in furthering our understanding of thermal physics. This is because we often study systems containing a practically uncountable number of particles, where indiviual atomic behaviour is unpredictable but collective behaviour is remarkably regular. On macroscopic scales, probabilistic predictions become suitably precise. Measureable quantities such as temperature or pressure emerge as averages over many atomic contribtions. Although each atom behaves differently, and tracking all atoms’ individual motions and collisions is an unfeasible feat, the ensemble’s average behaviour follows well-defined probability distribitions, allowing us to perceive and model system behaviour without complete knowledge.
Before we delve into the basics of probability theory, we establish a few definitions:
Probabilities are non-negative numbers which take values between 0 and 1.
For a given scenario, all possible outcomes of that scenario form a set of events, with each outcome having an associated probability.
If an outcome is not part of this set, its probability of occurring is zero.
If the event is certain, the probability of it occurring is one.
Events are considered ‘mutually exclusive’ if they cannot occur simultaneously.
The sum of probabilities for all mutually exclusive outcomes must equal one for a valid probability distribution.
2.1. Discrete and Continuous probability distributions#
Discrete Distributions#
Discrete random variables can only take a finite number of values. The classic example of which is the humble six-sided die, which has a set of outcomes: {1, 2, 3, 4, 5, 6}. If we denote \(x\) a discrete random variable which takes values (i.e. outcomes) \(x_i\) with corresponding probability \(P_i\), we can make a few definitions which encapsulate the properties of discrete distributions.
First, we require that the sum of probabilities for every possible value of a discrete random variable equate to one:
we then define the arithmetic mean as the expected value of x, denoted:
Intuitively, the idea is that for each possible outcome you adjust it’s contribution to the sum by its probability of occurance. This is called “weighting”. Were you to sample many many times, add all the outcomes you obtained and divide by the number of trials, you would eventually converge to the expected value. It is also possible to define the “mean squared” value of x through a similar procedure:
Do note that the expected value need not be present in the set of outcomes. A common example of this is the average number of children a family is expected to have across a population. These figures are often cited to occur between 1.8-2.4, yet it is only possible to have an integer number of children. These impossible values only make sense when considering a population rather than an individual sample.
Example: Expected value and mean squared
Consider a scenario where random variable \(x\) can take values {0, 1, 2} with corresponding probabilities {\(\frac{1}{2}\), \(\frac{1}{4}\), \(\frac{1}{4}\)}. This distribution is visualised in figure 2.1. Calculate the expected value for
(a) the variable, \(\langle x \rangle\)
(b) the mean squared of the variable, \(\langle x^2 \rangle\).
(a)
First check that \(\sum P_i = 1\). Since \(\frac{1}{2} + \frac{1}{4} + \frac{1}{4} = 1\) we are good to go. We then calculate the averages as follows:
We see that the mean \(\langle x \rangle\) is not one of the possible values \(x\) can take.
(b)
We follow a similar process for \(\langle x^2 \rangle\):
Continuous distributions#
Let \(x\) now be a continuous random variable, allowing it to take any value within specified bounds (these bounds could be infinite). We have to treat probabilities slightly differently here. Imagine a uniform distribution which, when sampled, can take any value between 1 and 10. One sample may yield an even 4, another something extremely specific such as 3.1415926535… Since there are infinite values that can be taken in this range, each individual value has effectively zero probability of occurring! Therefore, when calculating probabilities on continuous distributions we consider the probability of a variable having a value between some range \(x + dx\).
There are lots of real-life quantities that exist on continuous distributions. Height, commute durations, local temperature… These often have finite bounds, but there are infinite possibilities between them. Nonetheless, we enforce that the total probability of all values is one, however as we are now summing across continuous ranges we replace our sums with integrals:
Similarly, we have analogous formulae for \(\langle x \rangle\) and \(\langle x^2 \rangle\):
Uniform Distribution on [0, 10]
Let a continuous random variable \(x\) be uniformly distributed between 0 and 10:
What is the values of
(a) the expected value \(\langle x \rangle\)?
(b) the mean squared value \(\langle x^2 \rangle\)?
First, ensure that \(P(x)\) is a valid probability distribution by checking that it is normalised:
We can then compute the expected values:
(a)
(b)
The mean of 5 lies exactly halfway between the bounds, which is as you’d expected for a symmetric uniform distribution.
Exponential Lifetime Distribution
Now consider a physical system whose lifetime follows an exponential distribution, such as the lifetime of a radioactive nucleus. The probability density of its lifetime is given by
where \(\lambda\) is the decay rate constant. Calculate for this distribution:
(a) \(\langle t \rangle\)
(b) \(\langle t^2 \rangle\)
(a)
First, confirm normalisation:
Then compute the averages:
(b)
2.2. Measures of Central Tendencies#
When describing a probability distribution, we often want to identify simple central values which best represent the properties and sampling behaviour of a probability distribution. These are known as measures of central tendency, and the most common are the mean, median, and mode:
Mean ⟨x⟩: the expectation or average value of a distribution.
Median: the value that divides the distribution into two equal halves, i,e. the middle value.
Mode: the most probable value, where the probability is maximal.
For symmetric distributions (like the Gaussian), these three measures coincide onto the same value. For asymmetric or skewed distributions, they differ, and thus provide valuable information for characterising a distribution.
In skewed distributions, like in the positively skewed distribution in Figure 2.2, the mean is pulled towards the distribution’s tail whilst the median lies between the mean and mode. Regardless, the mode still represents the most probable value, i.e. the distribution’s peak.
Expectations of a function#
The expectation value can be taken with respect to any arbitrary function, \(f(x)\):
Variance#
The variance measures the average deviation of values around the mean of a distribution and is always positive. It is defined as follows:
We can expand the above expression to derive a simpler expression for calculating the variance of a scalar variable:
which is often more practical for calculations by hand. This manner of expressing the variance gives rise to a mnemonic expression with which you can memorise the formula - The variance is “the mean of the squares minus the square of the means.” The variance is for a variable \(x\) is often denoted \(\sigma^2_x\), and relates to the standard deviation, which is simply the square root of the variance:
2.3. Linear Transformations#
A linear transformation is a mathematical rule which maps one variable onto another through a combination of scaling and shifting. For a scalar it takes the general form \(y = ax + b\), where \(a\) and \(b\) are constants:
The constant \(a\) controls the scaling of the variable; how much the values are expanded or compressed.
The constant \(b\) controls the translation of the variable; i.e. how far the values are moved up or down the axis.
In graphical terms, if \(x\) is represented on the horizontal axis and \(y\) on the vertical, the transformation \(y = a x + b\) produces a straight line whose slope is \(a\) and intercept is \(b\).
Example: Inches to centimetres
The conversion from inches to centimetres, or vice versa, is a linear transformation. If we take \(x\) to be inches, then to find the value \(y\) centimetres, we simply use the scale factor of \(a=2.54\) inches per centimeter and an intercept \(b=0\):
Expectation under a linear transformation#
Because the expectation operator \(\langle \cdot \rangle\) is linear, its behaviour on a variable undergoing a linear transformation is simple:
This is because the expectation of a product is the product of expectations, and an expectation of sums is a sum of expectations. Broken down further:
This is because the expected value of a constant parameter is merely itself. Thus it linearly warps the original expectation.
Variance under a linear transformation#
Variance describes the spread of a distribution about its mean. For a variable \(x\) transformed linearly as \(y = a x + b\), we can derive how the variance changes.
Starting from the definition,
and substituting \(y = a x + b\),
The additive constant \(b\) cancels out because it merely shifts the entire distribution without changing its spread. The multiplicative constant \(a\) scales the variance by \(a^2\), since stretching or compressing the variable by \(a\) changes the width of the distribution quadratically. These relationships underpin how measurement uncertainty propagates through unit conversions or calibration - when a physical quantity is scaled by \(a\), its variance (and hence uncertainty) scales by \(a^2\).
2.4. Independent Variables in Probability#
In many physical problems we deal with more than one random variable. For example, the position and velocity of a molecule, or the outcomes of repeated measurements of the same quantity. Sometimes these values are independent of one another, meaning that knowledge of one variable provides no information about the other. A standard example of this is coin flips, knowing the outcome of one coin flip provides no picture on what the necxt flip will yield.
Formally, two random variables \(u\) and \(v\) are said to be independent if their joint probability distributed can be written as a product of their indovividual distributions:
thus the value taken by \(u\) does not depend on \(v\) and vice versa. This extends for any amount of independent variables, i,e. \(\vec{x} \in \mathbb{R}^N\) is described by:
A useful consequence of independence is that the expectation value of a product of indepedent variables factorises neatly:
We can achieve this because integrals for indepedent variables can be separated. This implies that the average value for the product of \(u\) and \(v\) is equal to the product of their average values. Again this can be extended to any arbitrary number of independent variables:
The same is true for discrete variables, or even a combination of discrete and continuous variables. Simply replace integral signs with sums for discrete components, but otherwise the product of expectation values is equivalent for these combinations of variable type.
Example: Mean and variance
Suppose that there are \(n\) independent variables, \(X_i\), each with the same mean \(\langle X \rangle\) and variance \(\sigma^2_X\). Let \(Y\) be the sum of the random variables, such that \(Y = X_1 + X_2 \ + \ ... \ + \ X_n\). Find
(a) the mean of Y
(b) the variance of Y
(a)
The mean of \(Y\) is simply the sum of each variable’s expectation value. As each variable \(X_i\) has the same mean \(\langle X \rangle\), we have:
(b)
Finding the variance of \(Y\) is a marginally more complicated matter. To start, let’s refer to the formula \(\sigma^2_Y = \langle Y^2 \rangle - \langle Y \rangle^2\). Seeing as we have \(\langle Y \rangle^2 = n^2 \langle X \rangle^2\), we only need to calculate \(\langle Y^2 \rangle\):
There are \(n\) terms like \(\langle X_1^2 \rangle\) on the right-hand side, and n(n-1) terms like \(\langle X_1 X_2 \rangle\). The former takes the value \(\langle X^2 \rangle\) and the latter \(\langle X \rangle \langle X \rangle = \langle X \rangle^2\). Therefore:
and,
This tells us that if we make \(n\) independent measurements of the same quantity, and then take their average via \(Y/n\), the uncertainty in that average is reduced by a factor \(\sqrt{n}\) compared to a single measurement:
TODO - RMS needs elaboration, cite random walks and possibly Brownian motion.
This principle - that averaging many independent measurements reduces random error - underlies much of experimental physics. However, it is only applicable for random, uncorrelated errors. Any systematic bias in a measurement setup will persist regardless of repetitions. A related idea actually appears in the study of random walks; such as the motion of a particle buffeted by molecules in a fluid. Each step (period of motion) is independent of the last, so whilst the average displacement after many steps is zero, the root-mean-square displacement grows as \(sqrt{n}\).
2.5. The Binomial Distribution#
Consider an experiment with only two possible outcomes, success and failure. This is called a Bernoulli trial.
Let the probability of success be p.
Then the probability of failure is (1 − p).
If we assign the value 1 to a success and 0 to a failure, the expectation and variance of a single trial are:
Now imagine performing \(n\) independent Bernoulli trials; for example, flipping a coin \(n\) times where we count \(k\) heads. Let us consider “heads” as a success.
There are two ingredients to the probability of obtaining k successes:
The probability of one specific sequence with k successes and n − k failures:
The number of possible sequences with exactly k successes:
Multiplying these gives the binomial probability of observing k successes given n trials:
2.6. Properties#
From the definition, one can show that the probability of all combinations of successes and failures sums to unity:
so the probabilities are properly normalised. The mean and variance of k are:
As n increases, both the mean and standard deviation grow, but the fractional width
decreases, meaning the distribution becomes more sharply peaked around k = np.
Example - coin tossing
For a fair coin, \(p = \frac{1}{2}\).
For 16 tosses, expected heads: \(\langle k \rangle = 8\); \(\sigma = 2\).
For \(10^{20}\) tosses, expected heads: \(\langle k \rangle = 5\times10^{19}\); \(\sigma = 5\times10^{9}\) - ten orders of magnitude smaller relative to the mean.
Thus, as the number of trials increases, the relative fluctuations around the mean gradually become negligible.
Example: 1D random walk
A one-dimensional random walk can be viewed as successive Bernoulli trials: each step is either forward (+L) or backward (−L) with equal probability \(p = \frac{1}{2}\).
After n steps, if k are forward, the net displacement is
Using the binomial results,
showing that the root-mean-square displacement grows as \(\sqrt{n}\) - a key result that later links to Brownian motion.
2.7. Conditional Probabilities - Bayes’ Theorem#
In many physical problems, events are not independent. More often than not, the probability of one event can depend on whether another has occurred. The framework in which we describe dependent events is called “conditional probability”.
Conditional probability#
If we have two events, A and B, the probability of A given that B has occurred is written:
provided \(P(B) \neq 0\). Here, \(P(A \cap B)\) means the probability that both A and B occur.
When events are independent, the occurrence of B does not affect A, so
and equivalently \(P(A \cap B) = P(A)P(B)\).
Bayes’ theorem#
Rearranging the definition of conditional probability gives an extremely useful result known as Bayes’ theorem:
This allows us to update our belief about A after observing B. In other words, it connects the prior probability \(P(A)\) with the posterior probability \(P(A|B)\).
Example: medical testing
A common example used to understand Bayes’ theorem is testing for disease. Suppose a disease affects 1% of a population. A test correctly identifies it 99% of the time, but also gives a 5% false-positive rate. In other words, if a person has the disease, the test will return a positive result 99% of the time, and a negative result 1% of the time. Meanwhile, if the person does not have the disease, it will return a negative result 95% of the time, and a positive result 5% of the time.
Let A = “person has disease” and B = “test is positive”. Then:
\(P(A) = 0.01\)
\(P(B|A) = 0.99\)
\(P(B|\neg A) = 0. 05\)
The overall probability of obtaining a positive test result must factor in both means of obtaining that result:
Applying Bayes’ theorem:
So even with a positive result, there’s only a 17% chance the person actually has the disease.
This illustrates how rare events can strongly influence conditional probabilities. Follow-up testing can help in reducing the uncertainty of a result, by tightening the variance over repeated trials. That is to say, the probability of obtaining several positive test results in a row, given that the person does not have the disease, is astoundingly small, (and is likely indicative that the person does in fact have the disease.)
2.8. Key Distributions - Guassian, Poisson, Maxwell-Boltzmann#
Gaussian (Normal) Distribution#
A continuous, bell-shaped probability density function:
The mean and variance are \(\langle x\rangle=\mu\) and \(\sigma^2\). By the Central Limit Theorem, the sum or average of many independent random variables tends toward a Gaussian distribution, regardless of the original shape.
2.9. Poisson Distribution#
A discrete distribution describing the number of rare, independent events occurring in a fixed interval:
The mean and variance are both \(\langle k\rangle=\lambda\). It arises as the limit of the binomial distribution for large \(n\), small \(p\), with \(\lambda = np\).
2.10. Maxwell–Boltzmann Speed Distribution (3D Ideal Gas)#
The Maxwell-Boltzmann distribution is a particular probability distribution named after James Clerk Maxwell and Ludwig Boltzmann, and was first defined for describing particle speeds in idealized gases.
Key quantities:
Most probable speed: \(v_p=\sqrt{\frac{2k_B T}{m}}\)
Mean speed: \(\langle v\rangle=\sqrt{\frac{8k_B T}{\pi m}}\)
RMS speed: \(v_{\mathrm{rms}}=\sqrt{\frac{3k_B T}{m}}\)